New implementation of hybridization expansion quantum impurity solver based on 

Newton-Leja interpolation polynomial 
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We introduce a new implementation of hybridization expansion continuous time quantum impu- 
rity solver which is relevant to dynamical mean-field theory. It employs Newton interpolation at 
a sequence of real Leja points to compute the time evolution of the local Hamiltonian efficiently. 
Since the new algorithm avoids not only computationally expansive matrix-matrix multiplications 
in conventional implementations but also huge memory consumptions required by Lanczos/Arnoldi 
iterations in recently developed Krylov subspace approach, it becomes advantageous over the pre- 
vious algorithms for quantum impurity models with five or more bands. In order to illustrate the 
great superiority and usefulness of our algorithm, we present realistic dynamical mean-field results 
for the electronic structures of representative correlated metal SrVC>3. 

PACS numbers: 02.70.Ss, 71.10.Fd, 71.30.+h, 71.10.Hf 



I. INTRODUCTION 

The rapid development of efficient numerical and ana- 
lytical methods for solving quantum impurity models has 
been driven in recent years by the great success of dy- 
namical mean-field theory (DMFT)ii2 and its non-local 
extensions^ In the framework of DMFT, the momentum 
dependence of self-energy is neglected, then the solution 
of general lattice model may be obtained from the solu- 
tion of an appropriately defined quantum impurity model 
plus a self-consistency condition. Both the non-local ex- 
tensions of DMFTl and realistic DMFT (i.e, local den- 
sity approximation combined with dynamical mean-field 
theory, LDA+DMFT) calculations f 2 - involve multi-site or 
multi-orbital quantum impurity models, whose solutions 
are computationally expensive and in practice the bottle- 
neck of the whole calculations. Therefore it is of crucial 
importance to develop fast, reliable, and accurate quan- 
tum impurity solvers. 

The multi-site or multi-orbital nature of the most rele- 
vant quantum impurity models favors quantum Monte 
Carlo methods. In the past two decades, perhaps 
the most commonly used impurity solver is the well- 
known Hirsch-Fye quantum Monte Carlo (HFQMC) 
algorithm^— This solver is numerically exact, but com- 
putationally expensive. Furthermore it suffers inevitable 
systematic error which is introduced by time discretiza- 
tion procedure, and thus is not suitable for solving the 
quantum impurity models under low temperature. 

Very recently, important progresses have been achieved 
with the development of various continuous time quan- 
tum Monte Carlo (CTQMC) impurity solvers, which are 
based on stochastic sampling of diagrammatic expansion 
of the partition function."— According to the differences 
in perturbation expansion terms, these CTQMC quan- 
tum impurity solvers can be classified into two types: 
weak coupling (also named as interaction expansion) and 



strong coupling (also named as hybridization expansion) 
implementations. The weak coupling CTQMC impurity 
solver was first proposed by Rubtsov et al.f£& who ex- 
panded the partition function in the interaction terms. 
This is the method of choice for cluster calculations of 
relatively simple models at small interactions, because 
the computational effort scales as the cube of the sys- 
tem size. As an useful complement, Werner and Mil- 
lis proposed2ii£ another powerful and flexible CTQMC 
impurity solver, which is based on a diagrammatic ex- 
pansion in the impurity-bath hybridization and the local 
interactions are treated exactly. Since this algorithm per- 
turbs around an exactly solved atomic limit, it is partic- 
ularly efficient at moderate and strong interactions. Fur- 
thermore, due to its ability to provide the information 
about atomic states?^ hybridization expansion quantum 
impurity solver is the desirable tool for LDA+DMFT 
calculations for strongly correlated materials. However, 
since the Hilbert space of the local problems grows expo- 
nentially with the number of sites or orbitals, the compu- 
tational effort scales exponentially, rather than cubically 
with system size. Hence the applications of hybridiza- 
tion expansion quantum impurity solver in LDA+DMFT 
calculations are severely constrained, especially for those 
multi-orbital impurity models with rotationally invariant 
interaction terms. 

With this obstacle in mind, in this paper we present 
a new efficient implementation for hybridization expan- 
sion quantum impurity solver, which enables reliable and 
fast simulations for multi-orbital models with up to seven 
orbitals on modern computer clusters or GPU-enable 
workstations. The rest of this paper is organized as fol- 
lows: In Sec[TT] a brief introduction to the conventional 
implementation^!!! and newly developed Krylov sub- 
space approach^ for the hybridization expansion quan- 
tum impurity solver in general matrix formalism is pro- 
vided. In Sec lIIII the new algorithm based on Newton 
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interpolation at real Leja points (for simplicity, in the 
following of this paper we just name it as Newton-Leja 
interpolation or Newton-Leja algorithm) is presented in 
details. Then in Sec lIVI we discuss the truncation ap- 
proximation, accuracy, and performance issues of the new 
algorithm and compare it with competitive Krylov sub- 
space approach. In Sec|V] the LDA+DMFT calculated 
results for typical strongly correlated metal SrVC>3 by 
using the Newton-Leja algorithm as an impurity solver 
are illustrated. Section I VII serves as a conclusion and 
outlook. In Appendix, concise introductions for the so- 
called Newton interpolation and Leja points are available 
as well. 



II. HYBRIDIZATION EXPANSION QUANTUM 
IMPURITY SOLVER 

The general quantum impurity model to be solved in 
the framework of single site DMFT can be given by the 
following Hamiltonian-ii^ 



H 



hyb 



H 



bath 



Hi, 



(i) 



Here a labels orbital, a labels spin, [i is the chemical 
potential, A a is the energy level shift for orbital a from 
a crystal field splitting, and n a ^ is the occupation op- 
erator. Hhyb, Hbath, and Hint denote impurity-bath hy- 
bridization, bath environment, and interaction terms, re- 
spectively. 

The basic idea of CTQMC impurity solvers is very 
simple. 5 ' 6 One begins from a general Hamiltonian H = 
H a + Hb, which is split into two parts labeled by a and 6, 
writes the partition function Z = Tie~P H in the interac- 
tion representation with respect to H a , and expands in 
powers of Hb, thus 



r i \k k tP k 
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(2) 



Here T is the time-ordering operator. The trace Tr[...] 



evaluates to a number and diagrammatic Monte Carlo 
method enables a sampling over all orders k, all topolo- 
gies of paths and diagrams, and all times ti, • • • , Tk in 
the same calculation. Because this method is formulated 
in continuous time from the beginning, time discretiza- 
tion errors which are severe in Hirsch-Fye algorithm^ do 
not have to be controlled any more. 

This continuous time method does not rely on an aux- 
iliary field decomposition and a particular partitioning of 
the Hamiltonian into "interacting" and "non-interacting" 
parts. In principles, the only requirement is that one may 
decompose the Hamiltonian in such a way that the time 
evolution associated with H a and the contractions of op- 
erators Hb may easily be evaluated. Thus there are sev- 
eral variations of CTQMC impurity solvers^ 6 - We note 
that all of the continuous time diagrammatic expansion 
algorithms are based on the same general idea, there are 
only significant differences in the specifics of how the ex- 
pansions are arranged, the measurements are done, and 
the errors are controlled. In this paper, we focus on the 
hybridization expansion algorithm merely. 

In the hybridization expansion algorithm, based on 
Eq.(JT} and Eq.©, Hb is taken to be the impurity-bath 
hybridization term Hhyb and H a = Hbath + Hi OCl where 



Hloc — Hii 



>-A a 



(3) 



Since 



= H 



hyb 
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(4) 



contains two terms which create and annihilate electrons 
on the impurity, respectively, only even powers of the ex- 
pansion and contributions with equal numbers of Hhyb 
and H^yb can yield a nonzero trace. Inserting the Hhyb 
and H hyb operators explicitly into Eq.([2]) and then sep- 
arating the bath operators (cj and c p ) and impurity op- 
erators (dj and dj), we finally obtain 
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^bath 
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(5) 



Here we define the bath partition function 
Zbath = Tre-P"-" = nil^ + 



a p 



(6) 



and A is a k x k matrix with elements A;. m = Aj, j m (t; 



r m ), where 



A 



yl*ym 



I III 



(t) = V — a 



Ar-P)^ T>0 

; e ° r , r < 0. 



(7) 



Next the diagrammatic Monte Carlo technique can be 
used to sample Eq.([S]). The two basic actions required 
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by ergodicity are the insertion and removal of a pair of 
creation and annihilation operators. Additional updates 
keeping the order k constant are typical not required for 
ergodicity but may speed up equilibrium and improve the 
sampling efficiency. 

In the Monte Carlo simulation, the most time consum- 
ing part is to calculate the following trace factor££ 



Wlc 



= E ^ 



bi...b k 
b'...bi 



(8) 



If Hi oc is diagonal in the occupation number basis de- 
fined by the d\ and d a operators, a separation of "flavors" 
(spin, site, orbital, etc.) as in the segment formalism 10 
is possible, and the computational efficiency is fairly sat- 
isfactory. Conversely, if Hi oc is not diagonal in the oc- 
cupation number basis, the calculation of wi oc becomes 
more involved and challenging. 

The conventional strategy proposed by Werner and 
Millis et al£.^ is to evaluate the trace factor (see Eq.©) 
in the eigenstate basis of Hi oc , because in this basis the 
time evolution operators e~ rHloa become diagonal and 
can be computed easily. On the other hand, in this rep- 
resentation the d a and d^ operators become complicated 
matrices. Hence this algorithm involves many multiplica- 
tions of matrices whose size is equal to the dimension of 
the Hilbert space of Hi oc . In order to facilitate the task 
of multiplying these operator matrices it is crucial to ar- 
range the eigenstates according to some carefully chosen 
good quantum numbers, 1 - Then the evaluation of the 
trace is reduced to block matrix multiplications. With 
this trick, the present state of the art is that five spin 
degenerate bands can be treated exactly. However, since 
the matrix blocks are dense and the largest blocks grow 
exponentially with system size, the simulation of bigger 
models becomes extreme expensive and is only doable if 
the size of the blocks is severely truncated. Various trun- 
cation and approximation schemes provide limited access 
to larger problems, but as the number of orbitals is in- 
creased the difficulties rapidly become insurmountable. 5,6 

Recently, Lauchli and Werner— present an implemen- 
tation of the hybridization expansion impurity solver 
which employs sparse matrix exact diagonalization tech- 
nique to compute the time evolution of the local Hamilto- 
nian Hi oc and then evaluate the weight of diagrammatic 
Monte Carlo configurations. They propose to adopt the 
occupation number basis, in which the creation and an- 
nihilation operators can easily be applied to any given 
states and in which the sparse nature of Hi oc matrix can 
be exploited during the imaginary time evolution by re- 
lying on mature Krylov subspace iteration methodJ^r— 
Their implementation is based on very efficient sparse 
matrix algorithm for the evaluation of matrix exponen- 
tials applied to a general vector, i.e., exp(— tHi oc )\v). At 
first the algorithm try to construct a Krylov subspace 

K p {W))=spm{\u), H loc \v), Hf oc \v), Hfjv)}, (9) 



by Lanczos or Arnoldi iterations, and then approximate 
the full matrix exponential of the Hamiltonian projected 
onto the Krylov subspace JC p (\v)). Here p means the 
dimension of the built Krylov subspace. It has been 
shown rigorously that these Krylov subspace iteration 
algorithms converge rapidly as a function of p^ Since 
this implementation involves only matrix- vector multipli- 
cations of the type d)\v), d\v), and Hi oc \v) with sparse 
operators d\ d, and symmetric matrix Hi oc , and is thus 
doable in principle even for systems for which the mul- 
tiplication of dense matrix blocks becomes prohibitively 
expensive or for which the matrix blocks will not even 
fit into the memory anymore. Their algorithm avoids 
computationally expensive matrix-matrix multiplications 
and becomes advantageous over the conventional imple- 
mentation for models with five or more bands. Lauchli 
et aL— have illustrated the power and usefulness of the 
Krylov subspace approach with dynamical mean-field re- 
sults for a given five-band model which captures some 
aspects of the physics of the iron-based superconductors. 



III. NEWTON-LEJA INTERPOLATION 
METHOD 

The spirit of our new implementation for hybridization 
expansion quantum impurity solver is quite similar with 
previous Krylov subspace approach^ We just adopt the 
occupation number basis and exploit the sparse nature 
of d) a and d a operators by applying them to any given 
states as well. But the kernel of our implementation is to 
evaluate the time evolution of sparse symmetric matrix 
Hi oc , i.e. exp(— tHi oc ), by Newton interpolation at a se- 
quence of real Leja points, instead of the Krylov subspace 
approach. Our algorithm inherits all the advantages of 
Krylov subspace approach, and is significantly superior to 
the latter on efficiency and memory consumption aspects. 
Consequently our implementation will be very promising 
to replace the Krylov subspace approach. 

How to efficiently evaluate the matrix exponentials of 
local Hamiltonian Hi oc applied to any given vectors, i.e., 
cxp(— tHi ov )\v), is the essential ingredient in hybridiza- 
tion expansion quantum impurity solver. We note that 
fast evaluation of the matrix exponential functions, just 
like exp(rA)v and (p(rA)v, is the key building block 
of the so-called "exponential integrators" in engineering 
mathematics and has received a strong impulse in recent 
yearsJ£-i£ Here A 6 TZ nxn , v G TZ n , t is arbitrary time 
step, and 



tp(z) 



exp(z) 



(10) 



To this respect, most authors regard Krylov-like as the 
methods of choice*^ Nevertheless, an alternative class 
of polynomial interpolation methods has been developed 
since the beginning^ which is based on direct interpo- 
lation or approximation of the matrix exponential func- 
tions on the eigenvalue spectrum (or the field of values) of 
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the relevant matrix A. Despite of a preprocessing stage 
needed to get a rough estimation of some marginal eigen- 
values, the latter is competitive with Krylov-like meth- 
ods in several instances, namely on large scale, sparse, 
and in general asymmetric matrices, usually arising from 
the spatial discretization of parabolic partial differential 
equations (PDEs)j2i 

Among others, the Newton interpolation based on real 
Leja points method (for a brief review of these two con- 
cepts, please refer to Appendix) has shown very attrac- 
tive computational features £lr— Given a general matrix 
A and a general vector v, Newton interpolation approxi- 
mates the exponential propagator as 

(P(tA)v ~p m (rA)v, (11) 

with m polynomial expansion order and p m {z) Newton 
interpolating polynomial of f(z) at a sequence of real 
Leja points {£/«} in a compact subset of the complex plane 
containing the eigenvalue spectrum of matrix _4.>2Ir— 

771 

p m (TA)v =5^d f fl J -t; 1 (12) 
3=0 

and 

fii = IlM-^' (13) 

k=0 

Here I is the unit matrix, and dj is the corresponding 
divided differences 2 ^ for <p(z). Observe that once <p(tA) 
is computed, then 

exp(r^l)u = tA<p(tA)v + v. (14) 

Thus in practice it is numerically convenient to interpo- 
late the function <p(tA) at first. 

By replacing general matrix A with —Hi oc and general 
vector v with and using Eq. (|ll[) - (I14I) . our algorithm 
rests on Newton interpolation of exp(— tHi ov )\v) at a se- 
quence of Leja points on the real focal interval, say [a, ft], 
of a suitable ellipse containing the eigenvalue spectrum 
of local Hamiltonian —Hi oc . The use of real Leja points 
is suggested by the fact that on such a well defined el- 
lipse they can give a stable interpolant, superlinearly con- 
vergent to entire functions due to the analogous scalar 
property ; 24 ' 26 i.e., 

lim sup\\(p(A)v -p m (A)v\\l /m = 0. (15) 

m— >oo 

In real arithmetic, a key step is given by estimating at 
low cost the reference focal interval [a, ft] for the eigen- 
value spectrum of —Hi oc matrix. In present works we 
adopt the simplest estimation given directly by the fa- 
mous Gershgorin's theorem. 

Now let us describe the full workflow for the trace eval- 
uation in some more details. In the initial stage of the hy- 
bridization expansion quantum impurity solver, the fol- 
lowing steps are necessary: (1) Calculate the low-lying 



eigenstates of Hi oc by Lanczos iteration algorithm or ex- 
act diagonalization technique, 1 Decide which eigenstates 
should be kept in the trace calculation and the other 
high-lying eigenstates are discarded. (2) Determine the 
real focal interval [a, ft] of a suitable ellipse containing 
the eigenvalue spectrum of matrix —Hi oc by using the 
Gershgorin's theorem. (3) Compute a sequence of real 
Leja points k = 0, ...,m — 1} on the interval [a, ft] 
by using the fast Leja points (FLPs) algorithm— and 
initialize the Newton-Leja algorithm. Depending on the 
simulation results gathered in present works the average 
degree for Newton interpolation m ~ 15, thus 64 real 
Leja points are enough to guarantee excellent conver- 
gence. Then, in the actual calculation of a trace factor, 
we proceed as follows: (4) Select an eigenstate as retained 
before, and propagate it to the first time evolution op- 
erator. Since the initial state is an eigenstate of Hi oc , 
it is simply multiplied by an exponential factor for the 
first time interval. (5) Apply the creation or annihilation 
operator on the propagated state by using the efficient 
sparse matrix-vector multiplication technique. (6) Prop- 
agate the current state to next time evolution operator 
using the Newton-Leja algorithm as described above. (7) 
Go back to step 5 if more creation and annihilation oper- 
ators are present. (8) Add the contribution of the prop- 
agated state to the trace. (9) Go back to step 4 until all 
retained eigenstates have been considered in the trace. 

The Newton-Leja algorithm turns out to be quite sim- 
ple and efficient, and its time complexity is very simi- 
lar with the Krylov subspace approach. According to 
Eq. (|12p and Eq. (|13[) . matrix- matrix multiplications are 
practically avoided, and the most important arithmetic is 
sparse matrix-vector multiplication. Furthermore, being 
based on vector recurrences in real arithmetic, its storage 
occupancy and computational cost are very small, and 
it results more efficient than Krylov subspace approach 
on large scale problems^ In addition, this algorithm is 
very well structured for a parallel implementation, as it 
has been demonstrated in the references^ It is worth to 
mention that except for the traditional parallelism strat- 
egy for random walking and Markov chain in the Monte 
Carlo algorithm, in a fine-grained parallelism algorithm 
the (4) ~ (8) steps can be easily parallelized over the re- 
tained eigenstates with multi-thread technique in mod- 
ern multi-core share memory computers. Further accel- 
eration by using CUDA-GPU technology is another very 
promising research area. In the next section we should 
address the performance and reliability issues of this new 
algorithm. 



IV. BENCHMARK 

In this section, we try to benchmark our new algo- 
rithm and compare the calculated results with other ex- 
isting implementations for hybridization expansion quan- 
tum impurity solvers. Three aspects, including trunca- 
tion error, reliability, and efficiency are mainly discussed. 
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FIG. 1. (Color online) Single particle Green's function G(t) 
of a three-band Hubbard model with j3 — 10 and U — 2.0 eV 
computed by the conventional matrix implementation (open 
squares, exact results) and the Newton-Leja algorithm (filled 
squares for O(l) level truncation and open circles for 0(2) 
level truncation of the trace). 

Just like Krylov subspace approach, the truncation ap- 
proximation can be adopted by our algorithm as well. 
And the high accuracy and superior performance of our 
algorithm are proved with extensive test cases. 

A spin degenerated three-band Hubbard model, with 
the following local Hamiltonian with rotationally invari- 
ant interactions 5 

Hloc = - - \i)n a .a + ^ Un a,i n a,l 

a, a a 

+ ^ W n a,c,n h - a + (U' - J)n a , a n b>cr ] 

(16) 

~ Yl J '( d i,l d b,t d b,i d a,T + h - C -) 
a<b 

- ^ J ' ( d b,t d b,l da ,t d a.,l + h.C.), 
a<b 

is used as a toy model to examine our implementa- 
tion. Here U' = U — 2J and Hund's coupling parameter 
J' = J = U/4. All the orbitals have equal bandwidth 4.0 
eV, and a semicircular density of states is chosen. The 
chemical potential fj, is fixed to keep the system at half 
filling and the crystal field splitting A a is set to be zero. 
Unless it is specifically stated, this model is used through- 
out this section. We solve this toy model in the frame- 
work of single site DMFT using variant implementations 
of hybridization expansion quantum impurity solvers, in- 
cluding conventional matrix implementation, 5 ' 6 Krylov 
subspace approach,— and our Newton-Leja algorithm. 

A. Truncation approximation 

In the calculation of trace factor, some high-lying 
eigenstates with negligible contributions can be aban- 
doned in advance to improve the computational effi- 



ciency. At low temperature region this truncation ap- 
proximation is practical, however, at high temperature 
region it must be used with great care. 

We solve the predefined three-band model (see 
Eg. ([To]) ) at extreme high temperature (/3 = 10, T ~ 1100 
K) and moderate interaction strength (U = 2.0 eV) to 
explore the influence of truncation approximation to the 
single particle Green's function. At first we can obtain 
the exact solutions by using the conventional matrix al- 
gorithm without any truncations. Then we run the sim- 
ulation again by using the Newton-Leja algorithm with 
O(l) level (in which only the 4-fold degenerated ground 
states are retained) and 0(2) level (in which not only 
the 4-fold degenerated ground states but also the 28-fold 
degenerated first excited states are kept) truncations re- 
spectively. 

The calculated results are shown in FigJTJ It is ap- 
parent that under O(l) level truncation there are signif- 
icantly systematic deviations between the approximate 
and exact results, while under 0(2) level truncation the 
deviations can be ignored safely. It has been suggested 
by Laiichli et alM- that the truncation approximation to 
the ground state vectors is legitimate just for tempera- 
tures which are < 1% of the bandwidth. Therefore in 
this case the temperature is too high to apply the 0(1) 
level truncation, but the 0(2) level truncation is still ac- 
ceptable. Finally, It should be pointed out that for this 
model the computational speed with 0(2) level trunca- 
tion is at least twice faster than that of full calculation 
without any truncations. 



B. Accuracy of Newton-Leja algorithm 

Next we try to demonstrate the accuracy of the new 
approach in a wider parameter range. The top panel of 
Fig|2]shows the measured single particle Green's function 
G(t) for /3 = 30 and different values of the Coulomb in- 
teraction strength U. Since the model temperature (T ~ 
390 K) meets the truncation criterion,—, both O(l) and 
0(2) level truncations are valid. Of course, the 0(2) 
level truncation can give finer results at the cost of per- 
formance, so in these simulations we apply the 0(1) level 
truncation merely. In this figure the open symbols were 
computed with the conventional matrix method without 
any truncations and used to test the precision of the 
Newton-Leja algorithm. Not surprisingly, essentially per- 
fect agreement between the two methods is found for all 
relevant interaction strengths. 

The bottom panel of Fig[5] illustrates the calculated 
results for U = 4.0 eV at different values of inverse tem- 
perature /3. The 0(1) level truncation is adopted for the 
Newton-Leja algorithm. The exact solutions obtained 
by traditional matrix method are shown as a compari- 
son. It is noticed that when the inverse temperature (3 
is 10, the deviations between traditional matrix imple- 
mentation and Newton-Leja algorithm are distinct. The 
temperature is lower, the smaller the deviation. And 
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FIG. 2. (Color online) Comparison between the single parti- 
cle Green's functions of a three-band model computed with 
the traditional matrix method (open symbols, without any 
truncations) and those computed with the Newton-Leja al- 
gorithm (full symbols, with O(l) level truncation). Upper 
panel: the calculated single particle Green's functions at dif- 
ferent interaction strengths U and /3 is fixed to 30. Lower 
panel: the calculated single particle Green's functions at U — 
4.0 eV and different inverse temperatures /3. 



for (3 > 20, the deviations can be considered negligible. 
In other words, their results become indistinguishable at 
temperatures which are < 1% of the bandwidth. Never- 
theless, the Newton-Leja algorithm with carefully chosen 
approximations is demonstrated to be controllable, reli- 
able and consistent with original implementations, and 
can be widely used in standard DMFT calculations. 



C. Efficiency of Newton-Leja algorithm 

Efficiency is always a major concern for newly devel- 
oped quantum impurity solver, so it is urgent for us to ex- 
plore the performance of Newton-Leja algorithm. Based 
on considerable testing, we found that when the system 
size is not large enough, the Newton-Leja algorithm is 
less efficiency than general matrix formulation^^ even 
the Krylov subspace approach, l? Whereas, when the sys- 
tem size is large enough, e.g., five-band or bigger model, 
the situation completely opposite. The Krylov subspace 
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FIG. 3. (Color online) Efficiencies (time consumption per 
evaluation of matrix exponentials exp(— AtHi oc )\v) is used 
as metric) of the Krylov subspace approach and Newton-Leja 
algorithm as a function of system size. The local Hamiltonian 
Hi oc with rotationally invariant interaction at half-filling is 
defined in Eq.fTB). where U = 4.0 eV and J/U = 0.25. At 
= 0.5 and 5.0 for typically short and long time intervals, re- 
spectively. The initial vector \v) is generated randomly. Both 
the maximum number of real Leja points for Newton-Leja 
algorithm m and maximum allowable dimension of Krylov 
subspace for Krylov approach p are 64. The results obtained 
by these two algorithms are consistent with each other within 
the machine precision. In the pink zone, the efficiency of 
Newton-Leja algorithm is clearly superior to Krylov subspace 
approach. 



approach outperforms the conventional matrix method 
and Newton-Leja algorithm is better than the Krylov 
subspace approach. In this subsection, we try to com- 
pare the efficiencies between Newton-Leja algorithm and 
Krylov subspace approach, and determine the system size 
for which the former is superior to the latter. 

Since the key building blocks for both Newton-Leja 
algorithm and Krylov subspace approach are the calcu- 
lations of time evolution operators, we just compare their 
efficiencies by the evaluation of exp(— AtHi oc )\v) . Here 
\v) is a randomly generated vector and time step At is 
set to be 0.5 and 5.0, corresponding to typically short 
and long time intervals, respectively. The multi-orbital 
local Hamiltonian H\ oc is constructed by using Eq.tfTB")). 
and the interaction strength U is fixed to be 4.0 eV and 
J' = J = U /4. Both the maximum allowable dimension 
for Krylov subspace p and number for real Leja points m 
are fixed to be 64, which are sufficient to obtain conver- 
gent and accurate results. 

In order to eliminate the influence of fluctuating mea- 
surement data, we repeated every benchmark for 20 times 
and then evaluated the average time consumption per 
evaluation of matrix exponentials. The benchmark re- 
sults for multi-orbital systems with n =1, 2, 7 (n 
labels the number of bands) are displayed in FigfS] As 
can be seen from the figure, when the system size is 
small or moderate (n < 4) the Newton-Leja algorithm 
exhibits worse performance than Krylov subspace ap- 
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FIG. 4. (Color online) Efficiencies (time consumption per 
evaluation of matrix exponentials exp(— AtHi oc ) \i>) is used 
as metric) of the Krylov subspace approach and Newton-Leja 
algorithm as a function of time interval for the five-band 
model with rotationally invariant interaction (U = 4.0 eV, 
J/U — 0.25). The chemical potential p is fixed to fulfill the 
half-filling condition. Randomly generated vector \v) is used 
as an initial state. The maximum number of real Leja points 
m is fixed to be 64, while the maximum allowable size of 
Krylov subspace p is varied from 16 to 64. The convergence 
criterion for all algorithms are the same. When At > 3.5, 
it is very difficult to obtain converged solutions for Krylov 
subspace approach by current settings. 



proach. However, when increasing the system size con- 
tinually (5 < n < 7, indicated in this figure by pink zone) 
the Newton-Leja algorithm is the winner. For instance, 
at n — 5 and Ar = 5.0 the Newton-Leja algorithm is 
almost ten times faster than the Krylov subspace ap- 
proach. More impressively, at n = 7 and Ar = 5.0, the 
Krylov subspace approach is slower than Newton-Leja al- 
gorithm about even four orders of magnitude. We note 
that for short time interval the performances for both 
implementations are close, but as for n > 5 the Newton- 
Leja algorithm still exhibits better performance. Accord- 
ing to these benchmarks, it is tentatively suggested that 
the Newton-Leja algorithm is much more suitable for the 
systems with five or more orbitals than Krylov subspace 
approach. 

Next we concentrate our attentions to the five-band 
model, which plays an important role in the underlying 
physics of transition metal compounds, and make further 
benchmarks for our new implementation. The average 
time consumption per evaluation of matrix exponentials 
exp(— AtHi oc )\v}) is used again as a measurement of ef- 
ficiency. The model parameters are consistent with pre- 
vious settings except the time interval Ar = 0.1 ~ 3.5. 
The maximum number of real Leja points m is fixed to 
be 64, while the maximum allowable dimension of Krylov 
subspace p is varied from 16 to 64. The benchmark re- 
sults are shown in Fig|4] As is illustrated in this figure, 
at short (Ar < 0.8) and long (Ar > 1.8) time inter- 
vals, the Newton-Leja algorithm shows better efficiency. 
It is apparent that when Ar > 2.0, actually no decline 



TABLE I. The average computational times per DMFT iter- 
ation for typical three-band and five-band Hubbard models 
solved by hybridization expansion quantum impurity solvers 
based on Newton-Leja algorithm and Krylov subspace ap- 
proach respectively. In current simulations, each DMFT it- 
eration took 40000000 Monte Carlo steps per processor, and 
the results were averaged over 16 processors. 



method 


three-band model five-band model 






U = 4.0 eV 


and = 20 


Newton-Leja, m = 


64 


3.23h 


29.07h 


Krylov, p — 64 




0.46h 


36.80h 






U = 4.0 eV 


and P = 40 


Newton-Leja, m = 


64 


6.12h 


67.32h 


Krylov, p — 64 




0.78h 


399.36h 



in efficiency for Newton-Leja algorithm is seen, i.e., the 
efficiency has nothing to do with the length of time in- 
terval. That is because the average degree of Newton 
interpolation (i.e. number of Leja points used in New- 
ton interpolation) almost remain unchanged. According 
to our experiences, m = 15 ~ 20 is suitable for most of 
the five, six, and seven-band models in general parameter 
ranges. As is mentioned above, provided limiting m and a 
well defined ellipse containing the eigenvalue spectrum of 
matrix —Hi oc , a stable interpolant, superlinearly conver- 
gent to matrix exponentials exp(— A.tHx oc )\v)) is guaran- 
teed by the Leja points method (see Eq.lJTSJl).— 12£ So we 
can decrease the maximum number of real Leja points m 
further to reduce the memory consumption and obtain 
higher efficiency. As for the Krylov subspace approach, 
at short time interval region the simulation with smaller 
Krylov subspace exhibits better performance, while at 
long time interval region the contrary is indeed true. It 
is very easy to be understood: at long time interval re- 
gion, the Krylov subspace algorithm requires larger di- 
mension of subspace to obtain fast convergence speed, 
and it is hardly to achieve convergence with small di- 
mension of Krylov subspace only when more iterations 
are done. However, it is impossible to increase the dimen- 
sion of Krylov subspace infinitely, the oversize of Krylov 
subspace will deteriorate the performance of Krylov ap- 
proach quickly. 

Finally, two concrete cases are provided to demon- 
strate the superior performance of Newton-Leja algo- 
rithm. Let's consider typical three-band and five-band 
Hubbard models respectively. The moderate Coulomb 
interaction strength U — 4.0 eV, and inverse temperature 
P = 20 or 40 (T - 580 K or 290 K) are chosen. These 
two models are solved separately by hybridization expan- 
sion quantum impurity solvers based on Newton-Leja al- 
gorithm and Krylov subspace approach with the same 
computational parameters, and the computational times 
are gathered and compared with each other. The bench- 
mark results are summarized in table HI It is confirmed 
again that for the three-band model Krylov subspace ap- 
proach is more efficient than Newton-Leja algorithm, yet 
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FIG. 5. (Color online) The single particle spectral functions 
for vanadium 3d states in SrV0 3 obtained by LDA+DMFT 
calculations. As for the Newton-Leja algorithm, the rotation- 
ally invariant interaction terms are taken into considerations, 
while for segment algorithm^ only the Ising-like density- 
density interaction terms are treated. The spectral function is 
obtained from imaginary time Green's function G(t) by using 
maximum entropy method^ and the calculated results are 
cross-checked by using recently developed stochastic analyti- 
cal continuation method.— Previous LDA+DMFT results,— 
in which traditional HFQMC is used as an impurity solver, 
are shown in this figure as a comparison. 

for the five-band model Newton-Leja algorithm exhibits 
much better performance, which is consistent with pre- 
vious benchmark results. Thus by any measure, our new 
implementation is better than Krylov subspace approach 
at low temperature and large system size. 



V. APPLICATION 

In this section, in order to illustrate the usefulness of 
Newton-Leja algorithm, we used it as a quantum impu- 
rity solver in the non-self-consistcnt LDA+DMFT calcu- 
lations for representative transition metal oxide SrV03. 
SrVC>3 is a well-known t\ e° metal. It is a good test 
case for LDA+DMFT calculations because it is cubic 
and nonmagnetic and also the t2 g bands are isolated 
from both e g and oxygen 2p bands in the LDA band 
structure. Numerous theoretical calculations (including 
LDA+DMFT)21"^£ and experiments 3 ^— have been done 
on this compound. This is thus an ideal system to bench- 
mark our implementation. 

The LDA+DMFT framework employed in present 
works has been described in the literatures ! 29 ' 37 ' 38 The 
ground state calculations have been carried out by us- 
ing the projector augmented wave (PAW) method with 
the abinit package. The cutoff energy for plane wave 
expansion is 20 Ha, and the fc-mesh for Brillouin zone 
integration is 12 x 12 x 12. The low-energy effective LDA 
Hamiltonian is obtained by applying a projection onto 
maximally localized Wannier function (MLWF) orbitals 
including all the vanadium 3d and oxygen 2p orbitals, 



which is described in details in reference [29[ . That would 
correspond to a 14 X 14 p — d Hamiltonian which is a 
minimal model^ 9 - required for a correct description of the 
electronic structure of SrVOa. 

The LDA+DMFT calculations presented below have 
been done for the experimental lattice constants (do = 
7.2605 a.u). All the calculations were preformed in para- 
magnetic state at the temperature of 1160 K (/? = 10). 
The Coulomb interaction is taken into considerations 
merely among vanadium 3d orbitals. In the present work, 
we choose U = 4.0 eV and J = 0.65 eV, which are accor- 
dance to previous LDA+DMFT calculations i^ 9 - We adopt 
the around mean-field (AMF) scheme proposed in refer- 
ence [29| to deal with the double counting energy. The 
effective quantum impurity problem for the DMFT part 
was solved by two implementations of hybridization ex- 
pansion CTQMC quantum impurity solver supplemented 
with recently developed orthogonal polynomial represen- 
tation algorithm.-^ The former implementation is based 
on the segment representation's, and only the Ising-like 
density-density interaction terms are treated. The lat- 
ter implementation is based on the Newton-Leja algo- 
rithm, and the local Hamiltonian is with rotationally 
invariant interactions (see Eq. (|16l) ). The maximum en- 
tropy method 3 - was used to perform analytical continua- 
tion to obtain the impurity spectral function from imag- 
inary time Green's function G(r) of vanadium 3d states. 
In the present simulations, each LDA+DMFT iteration 
took 40000000 Monte Carlo steps per process. Since the 
segment representation hybridization expansion impurity 
solver— is extreme efficient, it took less 5 hour to fin- 
ish 30 LDA+DMFT iterations by using a 8-cores Xeon 
CPU. Though 0(2) level approximation is adopted dur- 
ing the simulation, the Newton-Leja algorithm is much 
more time consumption and took about 12 ~ 13 hours 
to finish single LDA+DMFT iteration by using 64 cores 
in a Xeon cluster. 

The calculated single particle spectral functions for 
vanadium 3c? states in SrVC"3 are shown in FigO The 
calculated results within density-density interaction are 
consistent with previous LDA+DMFT simulations <22r— 
But the results obtained by classical segment representa- 
tion and Newton-Leja algorithm display remarkable dif- 
ferences. For examples, the upper and lower Hubbard 
bands in ti g sub-bands obtained by Newton-Leja algo- 
rithm with rotationally invariant interactions are more 
apparent. And the quasiparticle resonance peak around 
the Fermi level exhibits clearly shoulder structure, while 
that shoulder peak is absent in the segment picture 
(density-density interaction case) . Since the vanadium e g 
sub-bands are isolated from ti g states and located above 
the Fermi level, so they show roughly similar peak struc- 
tures for both two implementations of hybridization ex- 
pansion quantum impurity solvers. Nevertheless, based 
on our calculated results, it is suggested that the spin- 
flip and pair-hopping terms in rotationally invariant in- 
teractions may play a key role in understanding the sub- 
tle electronic structure of SrV03 around the Fermi level, 
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FIG. 6. (Color online) The first nine asymmetric real Leja 
points in [-2:2] interval generated by "Fast Leja Points" 
algorithm. 26 The labels denote the sequences of Leja points. 



which has been ignored by previous theoretical calcula- 
tions. 



VI. CONCLUSIONS 

We have presented an alternate implementation of the 
hybridization expansion quantum impurity solver which 
makes use of the Newton-Leja interpolation to evaluate 
the weight of Monte Carlo configurations. The new im- 
plementation inherits all the advantages of previously 
developed Krylov subspace approach with less memory 
consumption and better convergence control. It shows 
tremendous growth in computational performance over 
Krylov subspace approach and conventional matrix im- 
plementation at low temperature and large system size, 
and provides a controlled and efficient way which enables 
the LDA+DMFT study of transition metal compounds, 
or even actinide compounds with realistic interactions. 
To demonstrate the power of the new implementation, 
we used it as an impurity solver in the LDA+DMFT 
calculations for typical strongly correlated metal SrVC>3. 
The obtained impurity spectral function of vanadium tig 
states shows apparent distinctions with previous calcu- 
lated results. It is argued that the full Hund's exchange 
may has a big impact on the fine energy spectrum near 
the Fermi level. The generalization of Newton-Leja algo- 
rithm to high performance CUDA-GPU architecture is 
underway. 
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Appendix A: Leja points 

For the reader's convenience, here we briefly recall the 
definition of real Leja points. Sequences of Leja points 
{£?' l^o f° r the compact set K are defined recursively as 
follows: if £0 is an arbitrary fixed point in K, then £j are 
generated recursively in such a way that 

j-i j-i 

II & - ^1 = max J] |C - Cfcl, j = 1, 2, •• • . (Al) 

k=0 fc=0 

An efficient algorithm for computing a sequence of real 
Leja points, the so-called "Fast Leja Points (FLPs)", has 
been proposed in reference |26j |. In Fig |6] the first nine 
Leja points in [-2,2] interval are shown as a illustration. 
The Leja sequences are attractive for interpolation at 
high-degree, in view of the stability of the correspond- 
ing algorithm in the Newton interpolation^ 

Appendix B: Newton interpolation 

Given a set of m + 1 data points 

(x ,yo),...,(x m ,y m ) (Bl) 

where no two xj are the same, the so-called Newton in- 
terpolation polynomial is defined as follows 

m 

p m (x) = ^djttjix), (B2) 

3=0 

with dj the divided difference^ and the Newton basis 
polynomial £lj(x) defined as 

j-i 

^i{x) = Y[(x-x k ). (B3) 

fe=0 

It is apparent that Eq. (Tp2")) and Eq. (TT3l are the matrix 
forms of Eq. (|B2|) and Eq. (|B3[) . respectively. The New- 
ton form is an attractive representation for interpolation 
polynomials because it can be determined and evaluate 
rapidly, and, moreover, it is easy to determine p m +i if 
p m is already known. Because the Leja points are defined 
recursively, they are attractive to use with the Newton 
interpolation. 
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